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Abstract 

Given a point set S and an unknown metric d on S, we study the problem of efficiently 
partitioning S into k clusters while querying few distances between the points. In our model 
we assume that we have access to one versus all queries that given a point s £ S return 
the distances between s and all other points. We show that given a natural assumption 
about the structure of the instance, we can efficiently find an accurate clustering using only 
O(k) distance queries. Our algorithm uses an active selection strategy to choose a small 
set of points that we call landmarks, and considers only the distances between landmarks 
and other points to produce a clustering. We use our procedure to cluster proteins by 
sequence similarity. This setting nicely fits our model because we can use a fast sequence 
database search program to query a sequence against an entire dataset. We conduct an 
empirical study that shows that even though we query a small fraction of the distances 
between the points, we produce clusterings that are close to a desired clustering given by 
manual classification. 

Keywords: clustering, active clustering, fc-median, approximation algorithms, approxi- 
mation stability, clustering accuracy, protein sequences 
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1. Introduction 



Clustering from pairwise distance information is an important problem in the analysis and 
exploration of data. It has many variants and formulations and it has been extensively 
studied in many different communities, and many different clustering algorithms have been 
proposed. 

Many application domains ranging from computer vision to biology have recently faced 
an explosion of data, presenting several challenges to traditional clustering techniques. In 
particular, computing the distances between all pairs of points, as required by traditional 
clustering algorithms, has become infeasible in many application domains. As a consequence 
it has become increasingly important to develop effective clustering algorithms that can 
operate with limited distance information. 

In this work we initiate a study of clustering with limited distance information; in 
particular we consider clustering with a small number of one versus all queries. We can 
imagine at least two different ways to query distances between points. One way is to ask for 
distances between pairs of points, and the other is to ask for distances between one point 
and all other points. Clearly, a one versus all query can be implemented as | *S' | pairwise 
queries, but we draw a distinction between the two because the former is often significantly 
faster in practice if the query is implemented as a database search. 

Our main motivating example for considering one versus all distance queries is sequence 
similarity search in biology. A program such as BLAST ( Altschul et al. , 1990[ ) (Basic Local 
Alignment Search Tool) is optimized to search a single sequence against an entire database 
of sequences. On the other hand, performing |5| pairwise sequence alignments takes several 
orders of magnitude more time, even if the pairwise alignment is very fast. The disparity 
in runtime is due to the hashing that BLAST uses to identify regions of similarity between 
the input sequence and sequences in the database. The program maintains a hash table of 
all words in the database (substrings of a certain length), linking each word to its locations. 
When a query is performed, BLAST considers each word in the input sequence, and runs 
a local sequence alignment in each of its locations in the database. Therefore the program 
only performs a limited number of local sequence alignments, rather than aligning the input 
sequence to each sequence in the database. Of course, the downside is that we never consider 
alignments between sequences that do not share a word. However, in this case an alignment 
may not be relevant anyway, and we can assign a distance of infinity to the two sequences. 
Even though the search performed by BLAST is heuristic, it has been shown that protein 
sequence similarity identified by BLAST is meaningful (Brenner et al. 1998). 

Motivated by such scenarios, in this paper we consider the problem of clustering a 
dataset with an unknown distance function, given only the capability to ask one versus all 
distance queries. We design an efficient algorithm for clustering accurately with a small 
number of such queries. To formally analyze the correctness of our algorithm we assume 
that the distance function is a metric, and that our clustering problem satisfies a natural 
approximation stability property regarding the utility of the /c-median objective function 



in clustering the points. In particular, our analysis assumes the (c, e)-property of Balcan 



et al. (2009). For an objective function $ (such as k- median), the (c, e)-property assumes 



that any clustering that is a c-approximation of $ has error of at most e. To define what 
we mean by error we assume that there exists some unknown relevant "target" clustering 
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Ct', the error of a proposed clustering C is then the fraction of misclassified points under 
the optimal matching between the clusters in Ct and C. 

Our first main contribution is designing an algorithm that given the (c, e)-property for 
the k- median objective finds an accurate clustering with probability at least 1 — 5 by using 
only 0(k + hiy) one versus all queries. In particular, we use the same assumption as 
Balcan et al. ( |2009 ), and we obtain effectively the same performance guarantees as Balcan 
et al. ( 2009 ) but by only using a very small number of one versus all queries. In addition to 



handling this more difficult scenario, we also provide a much faster algorithm. The algorithm 
of Balcan et al. (2009) can be implemented in Od^l 3 ) time, while the one proposed here 



runs in time 0((k + In ^)\S\ log \S\). 

Our algorithm uses an active selection strategy to choose a small set of landmark points. 
We then construct a clustering using only the distances between landmarks and other points. 
The runtime of our algorithm is OdiHS*! log IS"!), where L is the set of landmarks that have 
been selected. Our adaptive selection procedure significantly reduces the query and time 
complexity of the algorithm. We show that using our adaptive procedure it suffices to 
choose only 0(k + In j) landmarks to produce an accurate clustering with probability at 
least 1 — 8. Using a random selection strategy we need 0(k In 4) landmarks if the clusters of 
the target clustering are balanced in size, and otherwise performance degrades significantly 
because we may need to sample points from much smaller clusters. 

We use our algorithm to cluster proteins by sequence similarity, and compare our results 



to gold standard manual classifications given in the Pfam (Finn et al. 2010) and SCOP 



(Murzin et al. 1995) databases. These classification databases are used ubiquitously in 



biology to observe evolutionary relationships between proteins and to find close relatives of 
particular proteins. We find that for one of these sources we obtain clusterings that usually 
closely match the given classification, and for the other the performance of our algorithm 
is comparable to that of the best known algorithms using the full distance matrix. Both 
of these classification databases have limited coverage, so a completely automated method 
such as ours can be useful in clustering proteins that have yet to be classified. Moreover, 
our method can cluster very large datasets because it is efficient and does not require the 
full distance matrix as input, which may be infeasible to obtain for a very large dataset. 

Related Work: A property that is related to the (c, e)-property is e-separability, which 



was introduced by Ostrovsky et al. (2006). A clustering instance is e-separated if the cost 
of the optimal ^-clustering is at most e 2 times the cost of the optimal clustering using k — 1 
clusters. The e-separability and (c, e) properties are related: in the case when the clusters 



(see Balcan et al. 2009). 



are large the Ostrovsky et al. (2006) condition implies the Balcan et al. (2009) condition 



Ostrovsky et al. also present a sampling method for choosing initial centers, which 
when followed by a single Lloyd-type descent step gives a constant factor approximation 
of the A:-means objective if the instance is e-separated. However, their sampling method 
needs information about the full distance matrix because the probability of picking two 
points as two cluster centers is proportional to their squared distance. A very similar 



(independently proposed) strategy is used by Arthur and Vassilvitskii (2007) to obtain 



an 0(log /^-approximation of the A:-means objective on arbitrary instances. Their work 



was further extended by Ailon et al. (2009) to give a constant factor approximation using 
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O(klogk) centers. The latter two algorithms can be implemented with k and 0(k\ogk) 
one versus all distance queries, respectively. 



Awasthi et al. (2010) have since improved the approximation guarantee of Ostrovsky 



et al. (2006) and some of the results of Balcan et al. ( 2009[ ). In particular, they show a way 



to arbitrarily closely approximate the £>median and A;-means objective when the |Balcan| 
et al. (2009) condition is satisfied and all the target clusters are large. In their analysis 



they use a property called weak deletion-stability, which is implied by the |Ostrovsky et al. 



(2006) condition and the Balcan et al. ( 2009[ ) condition when the target clusters are large. 
However, in order to find a c-approximation (and given our assumption a clustering that is 
e-close to the target) the runtime of their algorithm is n^ 1 /^ -1 ) )& (V( C-1 )). On the other 
hand, the runtime of our algorithm is completely independent of c, so it remains efficient 
even when the (c, e)-property holds only for some very small constant c. 

Approximate clustering using sampling has been studied extensively in recent years (see 
Mishra et al. 2001 Ben-David, 2007 Czumaj and Sohler, 2007). The methods proposed in 



these papers yield constant factor approximations to the /c-median objective using at least 
O(k) one versus all distance queries. However, as the constant factor of these approximations 
is at least 2, the proposed sampling methods do not necessarily yield clusterings close to the 
target clustering Ct if the (c, e)-property holds only for some small constant c < 2, which 
is the interesting case in our setting. 



Our landmark selection strategy is related to the farthest first traversal used by Dasgupta 



(2002). 



In each iteration this traversal selects the point that is farthest from the ones chosen 
so far, where distance from a point s to a set X is given by min x£ xd(s,x). This traversal 
was originally used by Gonzalez (1985) to give a 2-approximation to the fc-center problem. 
It is used in Dasgupta (2002) to produce a hierarchical clustering where for each k the 
induced fc-clustering is a constant factor approximation of the optimal /c-center clustering. 
Our selection strategy is somewhat different from farthest first traversal because in each 
iteration we uniformly at random choose one of the furthest points from the ones selected 
so far. In addition, the theoretical guarantees we provide are quite different from those of 
Gonzales and Dasgupta. 



2. Preliminaries 



Given a metric space M = (X, d) with point set X, an unknown distance function d satis- 
fying the triangle inequality, and a set of points S C X, we would like to find a /c-clustering 
C that partitions the points in S into k sets C\,...,Cy. by using one versus all distance 
queries. 

In our analysis we assume that S satisfies the (c, e)-property of Balcan et al. ( 2009[ ) 
for the /c- median objective function. The /c-median objective is to minimize ^(C) = 
Ya=i X^eC ^( x ' c *)' wnere °i is t ne median of cluster C«, which is the point y 6 Ci that min- 
imizes Ylxed d(x, y). Let OPT$ = mine where the minimum is over all fc-clusterings 
of S, and denote by C* = {C^, . . . , C£} a clustering achieving this value. 

To formalize the (c, e)-property we need to define a notion of distance between two k- 
clusterings C = {Ci, . . . , Ck} and C = {C[, . . . , C' k }. As in (Balcan et al. , 2009), we define 



the distance between C and C as the fraction of points on which they disagree under the 
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optimal matching of clusters in C to clusters in C': 

k 

dist(C,C') = mini^|a-^ w |, 

1=1 

where is the set of bijections a: {1, . . . , k} — > {1, . . . , k}. Two clusterings C and C" are 
e-close if dist(C, C") < e. 

We assume that there exists some unknown relevant "target" clustering Ct and given a 
proposed clustering C we define the error of C with respect to Ct as dist(C, Ct)- Our goal 
is to find a clustering of low error. 

The (c, e)-property is defined as follows. 

Definition 1 We say that the instance (S,d) satisfies the (c, e) -property for the k-median 
objective function with respect to the target clustering Ct if any clustering of S that approxi- 
mates OPT$ within a factor of c is e-close to Ct, that is, <£(C) < c-OPT$ => dist((7, Ct) < 
e. 

In the analysis of the next section we denote by c* the center point of C* , and use 
OPT to refer to the value of C* using the /c-median objective, that is, OPT = $(C*). We 
define the weight of point x to be the contribution of x to the fc-median objective in C*: 
w(x) = min, d{x , c*) . Similarly, we use W2(x) to denote x's distance to the second-closest 
cluster center among {c|, c\, . . . , c* k }. In addition, let w be the average weight of the points: 
w = ^ XlrreS w ( x ) = ^lF"> wnere n is the cardinality of S. 

3. Clustering With Limited Distance Information 



Algorithm 1 Landmark-Clustering^, a, e, 5, k) 

b = (1 + n/a)en- 
q = 2b; 

iter = 4k + 161n^; 

Smin — b + 1 J 

n' = n — b; 

L = Landmark-Selection((7, iter); 
C' = Expand-Landmarks (smin,^',^); 
Choose some landmark from each cluster C\\ 
for each x G S do 

Insert x into the cluster C'J for j = argmimd(x, k); 
end for 
return C"; 



In this section we present a new algorithm that accurately clusters a set of points 
assuming that the clustering instance satisfies the (c, e)-property for c = 1 + a, and the 
clusters in the target clustering Ct are not too small. The algorithm presented here is 
much faster than the one given by Balcan et al., and does not require all pairwise distances 
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as input. Instead, we only require 0(k + In j) one versus all distance queries to achieve the 
same performance guarantee as in (Balcan et al. 2009) with probability at least 1 — 5. 

Our clustering method is described in Algorithm [T] We start by using the Landmark- 
Selection procedure to adaptively select a small set of landmarks. This procedure repeatedly 
chooses uniformly at random one of the q furthest points from the ones selected so far, for 
an appropriate q. We use d m i n {s) to refer to the minimum distance between s and any point 
selected so far. Each time we select a new landmark I, we use a one versus all distance query 
to get the distances between / and all other points in the dataset, and update d m i n (s) for 
each point s £ S. To select a new landmark in each iteration, we choose a random number 
i G {n — q + 1, . . . , n} and use a linear time selection algorithm to select the ith furthest 
point. We note that our algorithm only uses the distances between landmarks and other 
points to produce a clustering. 



Algorithm 2 Landmark-Selection^, iter) 

Choose I G S uniformly at random; 
L = {1}; 

for each d(l,s) G QUERY-ONE- VS-ALL(7, S) do 

dmin(s) = d(l,s); 
end for 

for i = 1 to iter — 1 do 

Let si,...,s n be an ordering of the points in S such that d m i n (si) < d m i n {si+i) for 
i G {1, . . . ,n - 1}; 

Choose / G {s n __ g+ i, . . . , s n } uniformly at random; 
L = LU{1}; 

for each d(l,s) G QUERY-ONE- VS- ALL (/, S) do 
if d(l,s) < d min (s) then 

dmin(s) = d(l,s); 
end if 
end for 
end for 
return L; 



Expand- Landmarks then expands a ball Bi around each landmark I G L chosen by 
Landmark- Selection. We use the variable r to denote the radius of all the balls: Bi = {s G 
S | d(s, I) < r}. The algorithm starts with r = 0, and increments it until the balls satisfy a 
property described below. For each B\ there are n relevant values of r to try, each adding 
one more point to B[, which results in at most \L\n values to try in total. 

The algorithm maintains a graph Gb = (Vb,Eb), where vertices correspond to balls 
that have at least s m i n points in them, and two vertices are connected by an (undirected) 
edge if the corresponding balls overlap on any point: (v^^v^) G Eb iff B^ n Bi 2 ^ 0. In 
addition, we maintain the set of points in these balls Clustered = {s £ S \ 31: s € Bi} 
and a list of the connected components of Gb, which we refer to as Components'^) = 
{Comp 1 ,...,Comp m }. 

In each iteration, after we expand one of the balls by a point, we update Gb, Components'^), 
and Clustered. If Gb has exactly k components, and | Clustered | > n' , we terminate and 
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Figure 1: Balls around landmarks are displayed, with the next point to be added to a ball 
labeled as s*. 



report points in balls that are part of the same component in Gb as distinct clusters. If this 
condition is never satisfied, we report no-cluster. A sketch of the algorithm is given below. 
We use (l*,s*) to refer to the next landmark-point pair that is considered, corresponding 
to expanding Bi* to include s* (Figure [T]). 



Algorithm 3 Expand-Landmarks(s m i n , n' , L) 

1: while {(l*,s*) = Expand-Ball()) != null do 

2: r = d(l*,s*); 

3: update Gb, Components(Gs), and Clustered 

4: if | Components (G^) | = k and (Clustered) > n' then 

5: return C = {C\, C&} where C, = {s G S \ 3i. s G Bi and v\ G Compj}. 

6: end if 

7: end while 

8: return no-cluster; 



The last step of our algorithm takes the clustering C returned by Expand- Landmarks 
and improves it. We compute a set V that contains exactly one landmark from each 
cluster C[ G C (any landmark is sufficient), and assign each point x G S to the cluster 
corresponding to the closest landmark in L' . 

We now present our main theoretical guarantee for Algorithm [T] 

Theorem 2 Given a metric space M = (X, d), where d is unknown, and a set of points S, 
if the instance (S,d) satisfies the (1 + a, e) -property for the k-median objective function and 
if each cluster in the target clustering Ct has size at least (4 + 51/ a)en, then Landmark- 
Clustering outputs a clustering that is e-close to Ct with probability 1 — 5 in time 0{{k + 
In \)\S\ log |5|) using 0(k + In j) one versus all distance queries. 

Before we prove the theorem, we will introduce some notation and use an analysis 



similar to the one in (Balcan et al. 2009) to argue about the structure of the clustering 



instance. Let e* = dist(Cr, C*). By our assumption that the fc-median clustering of S 
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satisfies the (1 + a, e)-property we have e* < e. Since each cluster in the target clustering 
has at least (4 + 51 /a) en points, and the optimal k-median clustering C* differs from the 
target clustering by e*n < en points, each cluster in C* must have at least (3 + 51 /a) en 
points. 

Let us define the critical distance d cr n = j^. We call a point x good if both uu(x) < d cr it 
and W2 (x) — w(x) > 1 7cZ cr i t , else x is called bad. In other words, the good points are those 
points that are close to their own cluster center and far from any other cluster center. In 
addition, we will break up the good points into good sets Xj, where Xi is the set of the good 
points in the optimal cluster C* . So each set Xi is the "core" of the optimal cluster C* . 

Note that the distance between two points x,y 6 Xi satisfies d(x,y) < d(x,c*) + 
d(c*,y) = w(x) + w(y) < 2d CT n. In addition, the distance between any two points in 
different good sets is greater than 16d cr it- To see this, consider a pair of points x £ Xi and 
y G Xj^i. The distance from x to y's cluster center c* is at least 17d cr i t . By the triangle 
inequality, d(x,y) > d(x,c*) - d(y,c*) > 17d C rit - dcrit = 16d crit . 

If the k-median instance (M, S) satisfies the (1 + a, e)-property with respect to Ct, and 
each cluster in Ct has size at least 2en, then 



1. less than (e — e*)n points x S S on which Ct and C* agree have w%(x) — w(x) < 



2. at most 17 en/ 'a points x £ S have w(x) > 



aw 
17e- 



The first part is proved by Balcan et al. (2009). The intuition is that if too many 



points on which Ct and C* agree are close enough to the second-closest center among 
{c*, C2, . . . , c|}, then we can move them to the clusters corresponding to those centers, 
producing a clustering that is far from Ct, but whose objective value is close to OPT, 
violating the (1 + a, e)-property. The second part follows from the fact that X^eS w ( x ) = 
OPT = wn. 

Then using these facts and the definition of e* it follows that at most e*n + (e — e*)n + 
Ylenja = en + Ylenja = (1 + 17 /a)en = b points are bad. Hence each \Xi\ = \C*\B\ > 
(2 + U/a)en = 2b. 

In the remainder of this section we prove that given this structure of the clustering 
instance, Landmark- Clustering finds an accurate clustering. We first show that almost 
surely the set of landmarks returned by Landmark- Selection has the property that each of 
the cluster cores has a landmark near it. We then argue that given a set of landmarks 
with this property, Expand- Landmarks finds a partition C that clusters most of the points 
in each core correctly. We conclude with the proof of the theorem, which argues that the 
clustering returned by the last step of our procedure is a further improved clustering that 
is very close to C* and Ct- 

The Landmark- Clustering algorithm first uses Landmark- S 'election (q, iter) to choose a 
set of landmark points. The following lemma proves that for an appropriate choice of q 
after selecting only iter = 0(k + In i) landmarks with probability at least 1 — 5 one of them 
is closer than 2d cr it to some point in each good set. 



Lemma 3 Given L = Landmark- Selection (26, 4/c + 16 In i), with probability at least 1 — 5 
there is a landmark closer than 2d cr u to some point in each good set. 
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Proof Because there are at most 6 bad points and in each iteration we uniformly at random 
choose one of 26 points, the probability that a good point is added to L is at least 1/2 in 
each iteration. Using a Chernoff bound we show that the probability that fewer than k good 
points have been added to L after t > 2k iterations is less than e"^ 1- ^ 2 / 4 (Lemma Q. 
For t = 4k + 16 In | 

e -t(l-f ) 2 /4 < e -(4fc+161nI)0.5 2 /4 < e _ 16 l n i/16 = § _ 

Therefore after t = 4k + 16 In -r iterations this probability is smaller than <5. 

We argue that once we select k good points using our procedure, one of them must be 
closer than 2d cr it to some point in each good set. Note that the selected good points must 
be distinct because we must have chosen at least k good points after b + k iterations and we 
cannot choose the same point twice in the first n — 2b iterations. There are two possibilities 
regarding the first k good points added to L: they are either selected from distinct good 
sets, or at least two of them are selected from the same good set. 

If the former is true then the statement trivially holds. If the latter is true, consider 
the first time that a second point is chosen from the same good set Xi. Let us call these 
two points x and y, and assume that y is chosen after x. The distance between x and 
y must be less than 2d CT [ t because they are in the same good set. Therefore when y is 
chosen, mim e £ d(l, y) < d(x,y) < 2cf cr ;t- Moreover, y is chosen from {s n _ 2 6+i, s n }, where 
min/ e i d(l, Sj) < mim 6 L d(l, Sj+i). Therefore when y is chosen, at least n — 2b + 1 points 
s £ S (including y) satisfy min^i d(l, s) < mim g £ d(l, y) < 2d cr i t . Since each good set 
satisfies > 26, it follows that there must be a landmark closer than 2d CT n to some point 
in each good set. 



Lemma 4 The probability that fewer than k good points have been chosen as landmarks 
after t > 2k iterations of Landmark- Selection is less than e~ l ^ l ~^ //4 . 

Proof Let Xi be an indicator random variable defined as follows: Xi = 1 if point chosen 
in iteration i is a good point, and otherwise. Let X = Yl\=i -^i, and fi be the expectation 
of X. In other words, X is the number of good points chosen after t iterations of the 
algorithm, and \i is its expected value. 

Because in each round we uniformly at random choose one of 26 points and there are 
at most 6 bad points in total, E[Xj] > 1/2 and hence [i > t/2. By the Chernoff bound, for 
any 5 > 0, Pt[X < (1 - 5)fi] < e~^ 2 / 2 . 

If we set 6= 1-f , we have (l-8)n = (l-(l-f > (l-(l-f ))t/2 = k. Assuming 

that t > 2k, it follows that Pr[X < k) < Pi[X < (1 - S)n] < e"^ 2 / 2 = e^-f ) 2 / 2 < 
e -t/2(l-f) 2 /2_ u 

The algorithm then uses the Expand- Landmarks procedure to find a fc-clustering C. The 
following lemma states that C is an accurate clustering, and has an additional property 
that is relevant for the last part of the algorithm. 
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Figure 2: Balls Bi and Bj of radius r* are shown, which contain good sets Xi and Xj, 
respectively. The radius of the balls is small in comparison to the distance between the 
good sets. 



Lemma 5 Given a set of landmarks L chosen by Landmark- Selection so that the condition 
in Lemma^ is satisfied, Expand- Landmarksib + l,n — b, L) returns a k-clustering C = 
{C[,C' 2 , ■ ■ ■ C' k } in which each cluster contains points from a distinct good set Xi. If we let 
a be a bijection mapping each good set Xi to the cluster C' a u\ containing points from Xi, 
the distance between c* and any landmark I in C' a a\ satisfies d(c*,l) < 5d cr j t . 

Proof Lemma [6] argues that since the good sets Xi are well-separated, for r < 4d cr it no ball 
of radius r can overlap more than one Xi, and two balls that overlap different Xi cannot 
share any points. Moreover, since we only consider balls that have more than b points in 
them, and the number of bad points is at most b, each ball in Gb must overlap some good 
set. Lemma [7] argues that since there is a landmark near each good set, there is a value of 
r* < 4cZ cr it such that each Xi is contained in some ball around a landmark of radius r* . We 
can use these facts to argue for the correctness of the algorithm. 

First we observe that for r = r* , Gb has exactly k components and each good set Xi 
is contained within a distinct component. Each ball in Gb overlaps with some Xi, and 
by Lemma |6j since r* < 4d cr ;t , we know that each ball in G b overlaps with exactly one 
Xi. From Lemma [6] we also know that balls that overlap different Xi cannot share any 
points and are thus not connected in Gg. Therefore balls that overlap different Xi will be 
in different components in Gg. Moreover, by Lemma [7] each Xi is contained in some ball 
of radius r* . For each good set Xi let us designate by Bi a ball that contains all the points 
in Xi (Figure [2]), which is in Gb since the size of each good set satisfies \Xi\ > b. Any ball 
in Gb that overlaps Xi will be connected to Bi, and will thus be in the same component 
as Bi. Therefore for r = r* , Gb has exactly k components, one for each good set Xi that 
contains all the points in Xi. 

Since there are at least n — b good points that are in some Xi, this means that for r = r* 
the number of points that are in some ball in Gb (which are in Clustered) is at least n — b. 
Hence the condition in line 4 of Expand- Landmarks will be satisfied and the algorithm will 
terminate and return a /c-clustering in which each cluster contains points from a distinct 
good set Xi. 
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Now let us suppose that we start with r = 0. Consider the first value of r = r' for 
which the condition in line 4 is satisfied. At this point Gb has exactly k components and 
the number of points that are not in these components is at most b. It must be the case 
that r' < r* < 4<i cr i t because we know that the condition is satisfied for r = r* , and we 
are considering all relevant values of r in ascending order. As before, each ball in Gb must 
overlap some good set Xi. Again using Lemma [6] we argue that since r < 4cZ cr i t , no ball can 
overlap more than one Xi and two balls that overlap different Xi cannot share any points. 
It follows that each component of Gb contains points from a single Xi (so we cannot merge 
the good sets). Moreover, since the size of each good set satisfies \Xi\ > b, and there are 
at most b points left out of Gb, each component must contain points from a distinct Xi 
(so we cannot split the good sets). Thus we will return a ^-clustering in which each cluster 
contains points from a distinct good set Xi. 

To prove the second part of the statement, let a be a bijection matching each good set 
Xi to the cluster C^(i) containing points from Xi. Clearly, C,* is made up of points in 
balls of radius r < Ad cri t that overlap Xi. Consider any such ball B\ around landmark / 
and let s* denote any point on which Bi and Xi overlap. By the triangle inequality, the 
distance between c* and I satisfies d(c*,l) < d(c*,s*) + d(s*,l) < d CI i t + r < 5d cr i t - Therefore 
the distance between c* and any landmark I G C a {t\ satisfies d(c*,l) < 5d cr ;t. ■ 



Lemma 6 A ball of radius r < 4d cr ;t cannot contain points from more than one good set 
Xi, and two balls of radius r < 4(f cr it that overlap different Xi cannot share any points. 

Proof To prove the first part, consider a ball B[ of radius r < 4cZ cr i t around landmark I. 
In other words, Bi = {s £ S \ d(s,l) < r}. If Bi overlaps more than one good set, then it 
must have at least two points from different good sets x £ Xi and y £ Xj. By the triangle 
inequality it follows that d(x,y) < d(x,l) + d(l,y) < 2r < 8d cr ;t. However, we know that 
d(x,y) > 16<i cr it, giving a contradiction. 

To prove the second part, consider two balls B^ and Bi 2 of radius r < 4d cr it around 
landmarks l\ and I2. In other words, B^ = {s £ S \ d(s,l±) < r}, and B[ 2 = {s £ S \ 
d(s, I2) < r}. Assume that they overlap with different good sets Xi and Xj-. B^ n Xi ^ 
and Bi 2 n Xj 7^ 0. For the purpose of contradiction, let's assume that B^ and B\ 2 share 
at least one point: B^ n B[ 2 ^ 0, and use s* to refer to this point. By the triangle 
inequality, it follows that the distance between any point x £ and y £ Bi 2 satisfies 
d(x, y) < d(x, s*) + d(s*,y) < [d(x, h) + d(h, s*)] + [d(s*,l 2 ) + d(l 2 ,y)] < 4r < 16d crit . 

Since B^ overlaps with Xi and B[ 2 overlaps with Xj, it follows that there is a pair of 
points x £ Xi and y £ Xj such that d(x,y) < 16cZ cr i t , a contradiction. Therefore if B^ and 
B\ 2 overlap different good sets, B^ n B\ 2 =0. ■ 



Lemma 7 Given a set of landmarks L chosen by Landmark- Selection so that the condition 
in Lemma\^is satisfied, there is some value of r* < 4d cr it such that each Xi is contained in 
some ball B\ around landmark I £ L of radius r* . 
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Proof For each good set Xi choose a point s, G Xj and a landmark Zj G L that satisfy 
cZ(sj, Zj) < 2cZ cr it . The distance between Z, and each point x G Xi satisfies d(k, x) < d(k, S{) + 
d(si,x) < 2d cr it + 2dcri t = 4dcri t . 

Consider r* = maxi.max x ^Xid(li, x). Clearly, each Xi is contained in a ball B[. of radius 
r* and r* < 4(i cr it. ■ 



Lemma 8 Suppose the distance between c* and any landmark I in C' a ^ satisfies d(c*,l) < 
5cZ cr i t . Then given point x G C* that satisfies w%(x) —w(x) > 17d CT i t , for any l\ G C^-fi) an< ^ 
I2 G C' a y_£* it must be the case that d(x,l\) < d(x, Z2). 

Proof We will show that d(x,h) < w(x) + 5d CI i t (1), and d(x, I2) > w(x) + 12cZ cr it (2). 
This implies that d(x,li) < d(x, Z2). 

To prove (1), by the triangle inequality d(x, l\) < d(x,c*)+d(c*,li) = w(x) + d{c*,l\) < 
w(x) + 5d CT it- To prove (2), by the triangle inequality d(x, c*) < d(x, h) + d(l2, c*). It follows 
that d(x,l2) > d(x,c*) — d(l2,c*). Since d(x,c*) > W2{x) and d(l2,c*) < 5d cr ; t we have 

d[x, h) > W2{x) - 54rit- (1) 
Moreover, since W2{x) — w(x) > 17d cr i t we have 

w 2 (x) > 17d crit + w(x). (2) 
Combining Equations 1 and 2 it follows that d(x, 1%) > 17d cr i+w(x) — 5d cr it = w{x) + 12d cvlt . 



Proof [Theorem [2j After using Landmark-Selection to choose 0(k + In j) points, with 
probability at least 1 — 5 there is a landmark closer than 2cZ cr it to some point in each 
good set. Given a set of landmarks with this property, each cluster in the clustering C = 
{C[, C' 2 i ■ ■ ■ C' k } output by Expand- Landmarks contains points from a distinct good set X{. 
This clustering can exclude up to b points, all of which may be good. Nonetheless, this 
means that C may disagree with C* on only the bad points and at most b good points. 
The number of points that C and C* disagree on is therefore at most 2b = 0(en/a). Thus, 
C is at least 0(e/a)-close to C*, and at least 0(e/a + e)-close to Ct- 

Moreover, C' has an additional property that allows us to find a clustering that is e- 
close to Ct- If we use a to denote a bijection mapping each good set Xi to the cluster C',~ 
containing points from Xi, any landmark Z G is closer than 5cZ cr ;t to c*. We can use this 
observation to find all points that satisfy one of the properties of the good points: points x 
such that W2{x) — w(x) > 1 7cZ cr it . Let us call these points the detectable points. To clarify, 
the detectable points are those points that are much closer to their own cluster center than 
to any other cluster center in C*, and the good points are a subset of the detectable points 
that are also very close to their own cluster center. 

To find the detectable points using C, we choose some landmark Zj from each C[. For 
each point x G S, we then insert x into the cluster C" for j = argminjd(a;, Zj). Lemma [8] 
argues that each detectable point in C* is closer to every landmark in C" than to any 
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landmark in CV_^n. It follows that C" and C* agree on all the detectable points. Since 
there are fewer than (e — e*)n points on which Ct and C* agree that are not detectable, it 
follows that dist(C", C T ) < (e - e*) + dist(C T , C*) = (e - e*) + e* = e. 



Therefore using 0(k + In |) landmarks we get an accurate clustering with probability at 
least 1 — 5. The runtime of Landmark- Selection is 0(|L|n), where |L| is the number of land- 
marks. Using a min-heap to store all landmark-point pairs and a disjoint-set data structure 
to keep track of the connected components of Gb, Expand- Landmarks can be implemented 
in 0(|L|nlogn) time. A detailed description of this implementation is given in the next sec- 
tion. The last part of our procedure takes 0(kn) time, so the runtime of our implementation 
is O (\L\n log n). Therefore to get an accurate clustering with probability 1 — 5 the runtime 
of our algorithm is 0((k + In j)nlogn). Moreover, we only consider the distances between 
the landmarks and other points, so we only use 0(fc + ln |) one versus all distance queries. ■ 



4. Implementation of Expand-Landmarks 



In order to efficiently expand balls around landmarks, we build a min-heap H of landmark- 
point pairs (I, s), where the key of each pair is the distance between I and s. In each iteration 
we find (l*,s*) = if.deleteMin(), and then add s* to items(Z*), which stores the points in 
Bi*. We store points that have been clustered (points in balls of size larger than s m \ n ) in 
the set Clustered. 



Our implementation assigns each clustered point s to a "representative" landmark, de- 
noted by l(s). The representative landmark of s is the landmark I of the first large ball B\ 
that contains s. To efficiently update the components of Gb, we maintain a disjoint-set data 
structure U that contains sets corresponding to the connected components of Gb, where 
each ball B\ is represented by landmark /. In other words, U contains a set {Zi, I2, ■ ■ ■ , h} iff 
Bi ± ,Bi 2 , . . . , B\ i form a connected component Gb- For each large ball Bi our algorithm 
considers all points s G B[ and performs Update-Components(/, s), which works as follows. 
If s does not have a representative landmark we assign it to /, otherwise s must already be 
in -B;( s ), and we assign B[ to the same component as B^ s y If none of the points in Bi are 
assigned to other landmarks, it will be in its own component. A detailed description of the 
algorithm is given below. 
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Algorithm 4 Expand-Landmarks(s m i n , n', L) 



1 


A= (); 


2 


for each s £ S do 


3 


l(s) = null; 


4 


for each I £ L do 


5 


A.add((/,s),d(Z,s)); 


6 


end for 


7 


end for 


8 


H = build-heap (A); 


9 


for each I £ I do 


10 


items(Z) = (); 


11 


end for 


12 


Set Clustered = (); 


13 


U=(); 


14 


while i7.hasNext() do 


15 


(l*,s*) = ff.deleteMin(); 


16 


items(/*).add(s*); 


17 


if items(r).size() == s m \ n then 


18 


Activate(r); 


19 


end if 


20 


if items(/*).size() > s m i n then 


21 


Update-Components(/*, s*); 


22 


end if 


23 


if Clustered. size() > n' and ?7.size() 


24 


return Format-ClusteringQ; 


25 


end if 


26 


end while 


27 


return no-cluster; 



Algorithm 5 Update-Components(7, s) 

1: if l(s) == null then 
2: l(s) = I; 

3: else 

4: ci = C/.find(Z); 
5: c 2 = C/.find(Z(s)); 
6: f7.union(ci, C2); 

7: end if 



Algorithm 6 Activate^) 

1: £/.MakeSet(Z); 

2: for each s £ items(Z) do 

3: Update-Components(£, s); 

4: Clustered. add(s); 

5: end for 
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Algorithm 7 Format-ClusteringQ 



1:C = (); 

2: for each Set L in U do 

3: Set Cluster = (); 

4: for each I 6 L do 

5: for each s G items(Z) do 

6: Cluster, add(s); 

7: end for 

8: end for 

9: C.add(Cluster); 

10: end for 

11: return C; 



During the execution of the algorithm the connected components of Gb correspond to 
the sets of U (where each ball Bi is represented by landmark I). Suppose that and 
Bi 2 are connected in Gb, then B^ and B[ 2 must overlap on some point s. Without loss of 
generality, suppose s is added to B\ x before it is added to B\ 2 . When s is added to B\ x , l{s) 
= 1% if s does not yet have a representative landmark (lines 1-2 of Update-Components), 
or l(s) = I' and both l\ and V are put in the same set (lines 4-6 of Update-Components). 
When s is added to Bi 2 , if l(s) = l\, then l\ and I2 will be put in the same set. If l(s) = I' , 
/' and I2 will be put in the same set, which also contains l\. 

It follows that whenever B\ x and B{ 2 are in the same connected component in Gb, h 
and I2 will be in the same set in U. Moreover, if B^ and B\ 2 are not in the same component 
in Gb, then l\ and I2 can never be in the same set in U because both start in distinct sets 
(line 1 of Activate), and it is not possible for a set containing l\ to be merged with a set 
containing I2. 

It takes 0(\L\n) time to build H (linear in the size of the heap). Each deleteMin() opera- 
tion takes 0(log(|L|n)) (logarithmic in the size of the heap), which is equivalent to 0(log(n)) 
because \L\ < n. If f7 is implemented by a union-find algorithm Update-Components takes 
amortized time of 0(a(\L\), where a denotes the inverse Ackermann function. Moreover, 
Update-Components may only be called once for each iteration of the while loop in Expand- 
Landmarks (it is either called immediately on I* and s* if B[* is large enough, or it is called 
when the ball grows large enough in Activate). All other operations also take time propor- 
tional to the number of landmark-point pairs. So the runtime of this algorithm is 0(|L|n) 
+ iter • 0(logn + a(|L|)), where iter is the number of iterations of the while loop. As the 
number of iterations is bounded by \L\n, and a(|£|) is effectively constant, this gives a 
worst-case running time of 0{\L\n\ogn). 



5. Empirical Study 

We use our Landmark Clustering algorithm to cluster proteins using sequence similarity. 
As mentioned in the Introduction, one versus all distance queries are particularly relevant 



in this setting because of sequence database search programs such as BLAST (Altschul 



et al. 1990) (Basic Local Alignment Search Tool). BLAST aligns the queried sequence to 



sequences in the database, and produces a "bit score" for each alignment, which is a measure 
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of its quality (we invert the bit score to make it a distance). However, BLAST does not 
consider alignments with some of the sequences in the database, in which case we assign 
distances of infinity to the corresponding sequences. We observe that if we define distances 
in this manner they almost form a metric in practice: when we draw triplets of sequences 
at random and check the distances between them the triangle inequality is almost always 
satisfied. Moreover, BLAST is very successful at detecting sequence homology in large 
sequence databases, therefore it is plausible that clustering using these distances satisfies 
the (c, e)-property for some relevant clustering Ct- 

We perform experiments on datasets obtained from two classification databases: Pfam 
( Finn et al.[ |2010[ ), version 24.0, October 2009; and SCOP QMurzin et a!j |1995[ ), version 
1.75, June 2009. Both of these sources classify proteins by their evolutionary relatedness, 
therefore we can use their classifications as a ground truth to evaluate the clusterings pro- 
duced by our algorithm and other methods. 

Pfam classifies proteins using hidden Markov models (HMMs) that represent multiple 
sequence alignments. There are two levels in the Pfam classification hierarchy: family and 
clan. In our clustering experiments we compare with a classification at the family level 
because the relationships at the clan level are less likely to be discerned with sequence 
alignment. In each experiment we randomly select several large families (of size between 
1000 and 10000) from Pfam-A (the manually curated part of the classification), retrieve the 
sequences of the proteins in these families, and use our Landmark- Clustering algorithm to 
cluster the dataset. 

SCOP groups proteins on the basis of their 3D structures, so it only classifies proteins 
whose structure is known. Thus the datasets from SCOP are much smaller in size. The 
SCOP classification is also hierarchical: proteins are grouped by class, fold, superfamily, 
and family. We consider the classification at the superfamily level because this seems most 
appropriate given that we are only using sequence information. As with the Pfam data, 
in each experiment we create a dataset by randomly choosing several superfamilies (of size 
between 20 and 200), retrieve the sequences of the corresponding proteins, and use our 
Landmark- Clustering algorithm to cluster the dataset. 

Once we cluster a particular dataset, we compare the clustering to the manual clas- 
sification using the distance measure from the theoretical part of our work. To find the 
fraction of misclassified points under the optimal matching of clusters in C to clusters in C' 
we solve a minimum weight bipartite matching problem where the cost of matching Cj to 
CV% is \Ci — CrA/n. In addition, we compare clusterings to manual classifications using 



the F- measure, which was used in another study of clustering protein sequences (Paccanaro 



et al. 2006). The F- measure gives a score between and 1, where 1 indicates an exact 



match between the two clusterings (see Appendix A) . The F-measure has also been used in 
other studies (see Cheng et al. 2006), and is related to our notion of distance (Lemma [9] in 
Appendix A). 



5.1 Choice of Parameters 

To run Landmark- Clustering, we set k using the number of clusters in the ground truth 
clustering. For each Pfam dataset we use 40 A: landmarks/queries, and for each SCOP 
dataset we use 30 A: landmarks/queries. In addition, our algorithm uses three parameters 
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(a) Comparison using fraction of misclassified points (b) Comparison using the F-measure 

Figure 3: Comparing the performance of /c-means in the embedded space (blue) and 
Landmark- Clustering (red) on 10 datasets from Pfam. Datasets 1-10 are created by ran- 
domly choosing 8 families from Pfam of size s, 1000 < s < 10000. (a) Comparison using 
the distance measure from the theoretical part of our work, (b) Comparison using the 
F-measure. 



Smin, nf) whose value is set in the proof based on a and e, assuming that the clustering 
instance satisfies the (1 + a, e)-property. In practice we must choose some value for each 
parameter. In our experiments we set them as a function of the number of points in the 
dataset, and the number of clusters. We set q = 2n/k, s mm = 0.05n/k for Pfam datasets, 
and s m i n = O.ln/k for SCOP datasets, and n' = 0.5n. Since the selection of landmarks is 
randomized, for each dataset we perform several clusterings, compare each to the ground 
truth, and report the median quality. 

Landmark- Clustering is most sensitive to the s m i n parameter, and will not report a 
clustering if s mm is too small or too large. We recommend trying several reasonable values 
of Smin, in increasing or decreasing order, until you get a clustering and none of the clusters 
are too large. If you get a clustering where one of the clusters is very large, this likely means 
that several ground truth clusters have been merged. This may happen because s mm is too 
small causing balls of outliers to connect different cluster cores, or s m ; n is too large causing 
balls in different cluster cores to overlap. 

The algorithm is less sensitive to the n' parameter. However, if you set n' too large 
some ground truth clusters may be merged, so we recommend using a smaller value (0.5n < 
n' < 0.7n) because all of the points are still clustered during the last step. Again, for some 
values of re' the algorithm may not output a clustering, or output a clustering where some 
of the clusters are too large. Our algorithm is least sensitive to the q parameter. Using 
more landmarks (if you can afford it) can make up for a poor choice of q. 
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5.2 Results 



Figure [3] shows the results of our experiments on the Pfam datasets. One can see that for 
most of the datasets (other than datasets 7 and 9) we find a clustering that is almost identical 
to the ground truth. These datasets are very large, so as a benchmark for comparison we can 
only consider algorithms that use a comparable amount of distance information (since we do 
not have the full distance matrix). A natural choice is the following algorithm: randomly 
choose a set of landmarks L, \L\ = d; embed each point in a d- dimensional space using 
distances to L; use fc-means clustering in this space (with distances given by the Euclidian 
norm). Our embedding scheme is a Lipschitz embedding with singleton subsets (see Tang 



and Crovella, 2003), which gives distances with low distortion for points near each other in 



a metric space. 

Notice that this procedure uses exactly d one versus all distance queries, so we can set 
d equal to the number of queries used by our algorithm. We expect this algorithm to work 
well, and if you look at Figure [3] you can see that it finds reasonable clusterings. Still, the 
clusterings reported by this algorithm do not closely match the Pfam classification, showing 
that our results are indeed significant. 

Figure [4] shows the results of our experiments on the SCOP datasets. These results 
are not as good, which is likely because the SCOP classification at the superfamily level is 
based on biochemical and structural evidence in addition to sequence evidence. By contrast, 
the Pfam classification is based entirely on sequence information. Still, because the SCOP 
datasets are much smaller, we can compare our algorithm to methods that require distances 
between all the points. In particular, Paccanaro et al. ( 2006[ ) showed that spectral clustering 
using sequence data works well when applied to the proteins in SCOP. Thus we use the 
exact method described by Paccanaro et al. (2006) as a benchmark for comparison on the 
SCOP datasets. Moreover, other than clustering randomly generated datasets from SCOP, 
we also consider the two main examples from Paccanaro et al., which are labeled A and B 
in the figure. From Figure [4] we can see that the performance of Landmark- Clustering is 
comparable to that of the spectral method, which is very good considering that the algorithm 
used by Paccanaro et al. (2006) significantly outperforms other clustering algorithms on this 
data. Moreover, the spectral clustering algorithm requires the full distance matrix as input, 
and takes much longer to run. 



5.3 Testing the (c, e) property 

To see whether the (c, e) property is a reasonable assumption for our data, we look at 
whether our datasets have the structure implied by our assumption. We do this by mea- 
suring the separation of the ground truth clusters in our datasets. For each dataset in our 
study, we sample some points from each ground truth cluster. We consider whether the 
sampled points are more similar to points in the same cluster than to points in other clus- 
ters. More specifically, for each point we record the median within-cluster similarity, and 
the maximum between-cluster similarity. If our datasets indeed have well-separated cluster 
cores, as implied by our assumption, then for a lot of the points the median within-cluster 
similarity should be significantly larger than the maximum between-cluster similarity. We 
can see that this is indeed the case for the Pfam datasets. However, this is not typically 
the case for the SCOP datasets, where most points have little similarity to the majority 
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dataset 



dataset 



(a) Comparison using fraction of misclassified points (b) Comparison using the F-measure 

Figure 4: Comparing the performance of spectral clustering (blue) and Landmark- Clustering 



(red) on 10 datasets from SCOP. Datasets A and B are the two main examples from Pacca- 



naro et al. (2006 ), the other datasets (1-8) are created by randomly choosing 8 superfamilies 
from SCOP of size s, 20 < s < 200. (a) Comparison using the distance measure from the 
theoretical part of our work, (b) Comparison using the F-measure. 



of the points in their ground truth cluster. These observations explain our results on the 
two sets of data: we are able to accurately cluster the Pfam datasets, and our algorithm is 
much less accurate on the SCOP datasets. The complete results of these experiments can 
be found at http://cs-people.bu.edu/kvodski/clusteringProperties/description.html. 



6. Conclusion and Open Questions 



In this work we presented a new algorithm for clustering large datasets with limited dis- 
tance information. As opposed to previous settings, our goal was not to approximate some 
objective function like the /c-median objective, but to find clusterings close to the ground 
truth. We proved that our algorithm yields accurate clusterings with only a small number 
of one versus all distance queries, given a natural assumption about the structure of the 



clustering instance. This assumption has been previously analyzed by Balcan et al. (2009), 



but in the full distance information setting. By contrast, our algorithm uses only a small 
number of queries, it is much faster, and it has effectively the same formal performance 



guarantees as the one introduced by Balcan et al. (2009) 



To demonstrate the practical use of our algorithm, we clustered protein sequences using 
a sequence database search program as the one versus all query. We compared our results to 



gold standard manual classifications of protein evolutionary relatedness given in Pfam (Finn 



et al. 2010) and SCOP (Murzin et al. 1995). We find that our clusterings are comparable 



in accuracy to the classification given in Pfam. For SCOP our clusterings are as accurate 
as state of the art methods, which take longer to run and require the full distance matrix 
as input. 
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Our main theoretical guarantee assumes large target clusters. It would be interesting 
to design a provably correct algorithm for the case of small clusters as well. 
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Appendix A. 



In this section we reproduce the definition of F-measure, which is another way to evaluate 
the distance between two clusterings. We also show a relationship between our measure of 
distance and the F-measure. 



A.l F-measure 

The F-measure compares two clusterings C and C by matching each cluster in C to a cluster 
in C using a harmonic mean of Precision and Recall, and then computing a "per-point" 

average. If we match Q to C'j, Precision is defined as P(Cj, Cj) = ^ ^ . Recall is defined 

as R(Ci,Cj) = Iq., 3 . For Cj and Cj the harmonic mean of Precision and Recall is then 

equivalent to j^ppj^fj , which we denote by pr(Cj,Cj) to simplify notation. The F-measure 
is then defined as 



F(C,C) = - V \d\ max W (C h C'). 



n cVtc C '^ C ' 

Note that this quantity is between and 1, where 1 corresponds to an exact match between 
the two clusterings. 

Lemma 9 Given two clusterings C and C , i/dist(C, C) = d then F(C, C) > 1 — 3d/2. 

Proof Denote by a the optimal matching of clusters in C to clusters in C", which achieves 
a misclassification of dn points. We show that just considering pr(Cj, C^(i)) ^ or eacn Ci £ C 
achieves an F-measure of at least 1 — 3d/ 2: 



F(C,C) >lYl I^MQ,C<; w ) > 1 - 3d/2 



To see this, for a match of C, to C 1 ,^ we denote by mj the number of points that are 
in Ci but not in C^(i)' anc ^ ^y m i ^ e num ber of points that are in C^(i) but not in Cf. 
mj = \Ci — C' a ^\, mj = \C' a ^ — Ci\. Because the total number of misclassified points is dn 
it follows that 

m i = m i = dn - 

dec c t ec 

By definition, \Ci^C' a(i) \ = \Ci\-mj. Moreover, |C^ (i) | = \C' a(i) n d\ +m? < |Q|+mf. 
It follows that 



2(\d\ - mj) ^ 2(|Q| - mj) _ 2|Q| + mf mf + 2m, 1 ^ n m? + 2mJ 



P^W W J |Q| + |C' | - 2\C i \ + m? i 2\C i \ + mj 2\d\ + mj ~ 2\C, 



We can now see that 



i E a i E i«ia-=^|pl) - \ E E »?+^ - 

CiGC* CiGC 1 tl dec dec 
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